home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Internet Info 1993
/
Internet Info CD-ROM (Walnut Creek) (1993).iso
/
inet
/
internet-drafts
/
draft-levinson-sgml-00.txt
< prev
next >
Wrap
Text File
|
1993-10-19
|
31KB
|
741 lines
Network Working Group E. Levinson
Internet Draft: MIME/SGML Accurate Information Systems, Inc
<draft-levinson-sgml-00.txt> October 19, 1993
Expire: April 1993
MIME Content-types for SGML Documents
This draft document is being circulated for comment. Please send
your comments to the authors or to the ietf-822 maillist <ietf-
822@dimacs.rutgers.edu>. If consensus is reached this document
may be submitted to the RFC editor as a Proposed Standard protocol
specification for use with MIME.
Status of this Memo
This document is an Internet Draft; Internet Drafts are working
documents of the Internet Engineering Task Force (IETF) its
Areas, and Working Groups. Note that other groups may also
distribute working documents as Internet Drafts.
Internet Drafts are draft documents valid for a maximum of six
months. They may be updated, replaced, or obseoleted by other
documents at any time. It is not appropriate to use Internet
Drafts as reference material or to cite them other than as a
Rworking draftS or Rwork in progressS.
Please check the abstract listing in each Internet Draft
directory for the current status of this or any other Internet
Draft.
Abstract
This document specifies how a specific compound object, a complete
SGML document, is to be carried within a MIME message. MIME
provides a flexible mechanism for structuring RFC 822 message
bodies. To use that mechanism for compound documents requires
additional agreements on how the compound document is
represented and labelled within the message body. In addition,
this document specifies the requirements for using MIME to carry SGML
documents within a data stream in conformance with the SGML
Document Interchange Format (SDIF). That format provides a mechanism
for transferring one or more SGML documents. Subtypes are
proposed for the Multipart and Application content types to
support SGML documents and SDIF within MIME.
Compound documents, including SGML, consist of a number of
files, some of which may contain references to other files.
Explicit indications of the bindings between the sender's file
names and the MIME body parts are needed to re-bind the sender's
file names to ones on the recipient's system. A content
reference header field makes the bindings explicit.
1 Introduction
Many MIME [RFC-MIME] based mail User Agents can be readily
configured to display (and compose) standard message body content
types. These user agents invoke applications that correspond to
the particular content type. Standard content types exist for
data that consists of a single body part and there are mechanisms
to convey multiple body parts. However there is no standard
mechanism for objects like compound documents that contain
multiple, inter-related body parts.
Compound documents are represented in various mark-up languages,
e.g. troff, text/enriched. This document provides a mechanism for
embedding, inside an Internet mail [RFC-822, RFC-MIME] message,
complete documents of one such markup language, the Standard
Generalized Markup Language (SGML) [ISO-SGML].
1.1 SGML
SGML is used in several communities to encode document structure
and layout. A rigorous description of SGML is left to [ISO-SGML];
Appendix A contains an unbelievably brief description of the SGML
elements relevant to this document. In this document attempts to be
consistent with SGML terminology and usage. A complete SGML
document consists of an SGML declaration, a document type
definition (DTD), and a document instance. The document instance
may, recursively, contain subdocuments which consist of DTDs and
an instance. The applications that process SGML documents may
require the document parts to be individual files or combined in a
single file.
For a person or application (a recipient) to receive and display a
complete SGML document a precise definition for each of the SGML
document parts must be carried within the mail message. In the
sender's environment these parts may be references to standard
parts or specific files in the sender's file system. Further, a
DTD may reference other files, for example, images and graphics.
The identity of the document parts as well as the name and
contents of each file must be transferred. Sufficient information
must be accompany the data for the recipient to transform the
sender's file name into an equivalent local reference.
1.2 SDIF
The SGML Document Interchange Format, SDIF, [ISO-SDIF] specifies
the structure for a data stream whcih contains one or more SGML
documents. SDIF is focused on transferring documents between
sites and does not include a requirement that the documents be
displayed as they are encountered. Users of mail based systems,
however, expect to have each mail item in a multipart message
displayed -- more precisely, ready for display -- when
encouneterd. This document shows how to meet both the SDIF and display
requirements.
1.3 Organization of this Memorandum
First a body part content type for a simple SGML document is
defined. The discussion of that content type expains the SGML
specific parameters and explores a number of the issues that arise
when transferring an SGML document from one system to another.
More complex documents are handled via a Multipart subtype.
Discussion of that subtype explores additional document transfer
issues. The discussion concludes by presenting the content types
required to create an SDIF conformant data stream.
2 Processing Model for MIME/SGML
Four issues must be addressed for the recipientUs user agent to
display a complete SGML document: the various parts must be
specified and file and command references on the sender's systems
must be resolved to references on the receiverUs system. Finally,
an appropriate application, an unpacker, must be in control to
unpack of the MIME body parts containing the document and present
them to the display software. The controlling application is
discussed first and then the document parts, file references, and
command references.
2.1 Invoking the SGML Parser Application
MIME offers the possibility to add SGML capability to existing
mail user agents. To accomplish this with existing SGML viewers
and composers a process must be interposed between the SGML
application and the user agent to translate from MIME format to a
form acceptable to the local SGML application. This document uses the
notation in [ISO-SDIF] where the process creating the data stream
(here, the MIME message) is called the packer; the correponding
one for the receiver, the unpacker.
Normally one expects a MIME capable user-agent to display each
body part in turn, usually in a depth first manner. For a
compound document the display must be deferred until all the body
parts are available to the application and are structured
according to the requirements of the SGML application. For example,
some SGML viewers expect the DTD and instance be combined in a single
file, others expect them to be separate. "Available" means that the
files corresponding to the various document parts have been
instantiated on the receiver's system. Once instantiated, an SGML
viewer can be invoked. Similarly, an SGML composer will create
the respective parts using its own file structure. For example,
an SGML application may expect the DTD and document instance to be
in a single file. The unpacker separates these parts and
encapsulates as separate MIME body parts.
2.2 Specifying the Document Parts
Different implementations of SGML parsers use different methods
for storing the SGML declaration, DTD, and document instance.
Consequently the unpacker may find these parts as separate body
parts or as a single part and must store them as the local
application requires. There are several ways to specify each part
of a complete SGML document. The declaration part may be a
default value and not included, an file which is included, or it
may be part of the document instance. It could also be a file
each correspondent already has.
An easy solution would be to require a standard form, perhaps a
single file, a concatenation of the declaration, DTD, and
instance. That would often require the transferring much more
data than needed, often only the document intsance is required.
The discussion so far assumes that there is only one file (or its
equivalent). While there may be many files, for SGML document
instances as opposed to files of image (or other) data, the
consideration here is only how to specify the document or sub-
document SGML elements. The next section considers other data .
Rather than require a standard form, this document permits the SGML
document parts to specified as parameters. Thus a sender may
choose to send the declaration, DTD, and instance as a single file
or may choose to specify any of them as a parameter. If neither
the SGML declaration nor document type declaration is specified it
may be in the message body; if it is not there then the recipient
is free to apply a local default.
These parameters are provided for each document or subdocument
instance.
sgml-parm := *( ";" sgml-part "=" sgml-part-spec)
[ ";" "version" "=" iso-sgml-spec ]
[ ";" "created-with" "=" ref-or-tok ]
[ ";" "character-set" "=" charset ]
sgml-part := "instance" / "declaration"
/ "dtd" / "fosi" / extension-token
sgml-part-spec := file-token / sgml-public / extension-token
sgml-public := <An SGML PUBLIC identifier>
iso-sgml-spec := <The identifier of the SGML specification
to which the document conforms, e.g. ISO
8879-1986>
Sgml-parts specify the various parts of a complete document. File-
tokens are discussed in the next section. If used that file's
contents will be contained in a body part and will be labelled
with a content-reference: field. Sgml-public are identifiers
defined in [ISO SGML] which represent well known files or
entities. The SGML parser is expected to resolve these references
on its own. Although the SGML definition provides for
associating location (local file system) information with public
data this document does not supported it. It is possible to provide
support for that capability in the unpacker.
The two parameters, version and created-with, are provided for
guidance to user agents. Version specifies the particular SGML
standard to which the document conforms. A user agent can use
this value to invoke the application appropriate to that version
of the standard. The created-with parameter provides guidance in
cases where inter-operability with respect to SGML may be a
problem. In those environments, where user's maintain several of
SGML processors, this parameter can be used to invoke the
appropriate implementation.
The character-set parameter specifies the body part character
set. If not specified, the default is us-ascii.
2.3 Resolving File References
SGML permits the DTD to define document parts (entities) that a
document instance can reference for inclusion or interpolation.
The entities point to files that can contain SGML coded text, text
not to be interpreted, images, or other data. Within SGML there
are two types of file reference entitites SYSTEM and PUBLIC.
PUBLIC entities specify SGML document parts that are known to and
resolvable by SGML viewers and editors. The SYSTEM identifiers
refer to files in the local environment. In order for the
recipient's SGML application to properly process the document, the
file references must be resolvable in the recipient's environment.
Conceptually, one must replace each of the sender's file
references with a corresponding reference in the recipient's file
system.
There are two issues here. First, the sending user agent must
parse the document and identity the sender's file references.
Second, the internally referenced file will become a MIME body
part and the correspondence between the file name and the body
part must be preserved. This document applies the principle of "sender
makes right" to these issues and requires first, that the packer
converts all file references into a unique token containing only
US-ASCII characters. Second, those files will be a body part in a
multipart MIME message and the corresponding body part header must
contain a Content-Reference: field whose value contains the file's
token. Thus, the internal file name, now a token which can appear
in an 822 header, explicitly appears in the document and its
corresponding MIME body part using the Content-Reference: field.
When the unpacker stores the body part in the recipient's file
system it can convert the internal file references (tokens) into
valid local references.
2.4 Processors for Non-SGML Data
Non-SGML data requires the SGML parser to invoke a processor to
format the data. The correspondence between the file name and the
application is contained in the type field of the SGML entity
declaration and the SGML notation declaration for that type.
The notation declaration contains the operating system command
string to invoke or launch the processor. That is, the string in
the notation declaration is an arbitrary command. There are two
problems with this situation, the command may only be valid in the
sender's environment and, if it is valid in the recipient's ,
invoking that command is a security hazard.
Therefore, this document requires that any type used in an SGML
notation be an valid MIME content type (or an extension token) and
that the unpacker substitute a local string for the string in the
notation declaration.
3 The SGML Subtypes
A complete document may be a single instance in which all the
other document parts are defined by existing standards or private
agreements. It may also be a set of parts several of which must
be included in the MIME message. Two SGML subtypes are defined,
content types application and multipart. Both
body part content types use the same parameters. The multipart
subtype is considered first, it is the general case. The
application subtype is a simplification for the case where the
multipart would contain a single part. It is also used to contain
SGML subdocument entities, that is text with mark-up.
3.1 The Multipart/SGML Subtype
An SGML document carried in a MIME message as a Multipart body of
subtype SGML (Content-Type: Multipart/SGML). The content-
type parameters specify each of the parts of the SGML document.
Additional parameters specify the software that created the
document and the applicable SGML standard.
In a complex document some of the SGML document parts are
references to standard parts and the others as filenames. In the
latter case the filename tokens must appear in exactly one Content-
Reference: header in an enclosed body part. Inside the document
itself, the file names must be replaced by their tokens.
Thus a complete SGML document can appear as the following MIME
message.
Content-Type: Multipart/SGML; instance=SSBradio;
dtd=sgml-dtd-mtce-radio; boundary=tiger-lily
--tiger-lily
Content-type: Application/SGML
Content-reference: SSBradio
<! ... an SGML instance >
--tiger-lily
Content-type: Image/gif
Content-reference: sgml-radio-figure-1
...
--tiger-lily
Content-type: Application/SGML
Content-reference: sgml-dtd-mtce-radio
<! ... a DTD that references the file Figure1>
--tiger-lily--
3.2 The Application/SGML Subtype
When transferring a file containing text and mark-up within a
Multipart/SGML message or when a complete SGML document can be
contained in a single message the content-type: Application/sgml
can be used.
application-subtype := ("octet-stream" *stream)
/ "postscript"
/ ("sgml" *sgml-parm)
/ extension-token
The following example shows a MIME message an document instance
which specifies a dtd.
Content-Type: application/SGML;
dtd="//USA-DOD//DTD MIL-M-21742 911991//EN"
<! ... an SGML instance >
3.3 Character Set Considerations
It is expected that SGML documents will use the ASCII character
set. For documents not in the US-ASCII character set, the
charset= parameter of the Content-Type: field specifies the actual
character set. Note that the values of the charset parameter must
be registered with IANA, or be a mutually agreed upon extension-
token (i.e., charset=X-set).
Values contained in the MIME headers must use be drawn from [US-
ASCII] and conform to[RFC-822]. Where the sender's file names do
not meet this requirement the conventions specified in [RFC-HDRC]
may be used.
4 The Content-Reference Header Field
The Content-Reference: header field provides the linkage between
file references within the SGML document and the MIME body parts.
It contains the unique file name token which represents the
sender's file name to which the body part corresponds. The
process that handles the Multipart/Compound-SGML body part will
use this value to convert internal file references into valid
references in the receiver's file system.
The syntax is:
reference := "Content-Reference" ":" (token / quoted-string)
5 SDIF [ISO-SDIF] Data Streams
[This part need work -- Ed]
SDIF is an interchange format standard for SGML documents [ISO-
SDIF]. It defines a data stream that may contain several SGML
documents. This section defines a Multipart subtype RSDIFS for an
SDIF data stream that contains one or more Multipart/SGML
documents. Messages that conform to the SDIF subtype will conform
to [ISO-SDIF].
Briefly an SDIF data stream is a sequence of SGML documents and
their subdocument and external entities (c.f. Appendix A). These
external entities are defined in the DTD and are referred to via
their SGML name in the document intstance. The scope of an enitiy
name is the document or subdocument in which it is defined. Thus
names are not unique across documents and subdocuments. To
provide unique names within the SDIF data stream, each entity is
assigned a sequential number. Each SGML document or subdocument
structure in the SDIF stream lists the number of the first entity
it contains.
An SDIF data stream is encoded within a MIME message as a
Multipart/SDIF body part. It contains one to three body parts.
The first and last body parts are optional. They are labelled
with a content description field whose value is "related-documents-A"
and "related-documents-B" respectively and are Multipart/Mixed.
These multipart bodies contain only
Multipart/SGML or Application/SGML (mime/sgml is for convenience
where the particular content type does not matter) body parts.
The second body part, of the three Multipart/SDIF body parts, is a
mime/sgml body part.
The Multipart/SDIF content type has a character set parameter
which specifies the character set used for SGML markup tokens
through-out the data stream.
There are five SDIF entity types:
subdocument These can contain references to external
entities as well as marked up text.
text An external entity containing only marked
up text.
data An external entity containing non-SGML
data, images, for example.
public-text Corresponds to a PUBLIC external reference
and contains a NULL message body. [A
reminder to readers without intimate
knowledge of SGML, PUBLIC text can be
located by SGML processors without further
identification.]
cross-reference Corresponds to a previously included
external entity. This avoids duplicating
material previously included. It contains
a NULL message body. This docment requires, in
contrast to [ISO-SDIF], that the
referenced body part have already appeared.
That requirement enables the user agent to
display the SGML documents as they are
encountered.
The subdoucment and text SDIF entities become Application/SGML
body parts and data entities are encapsulated as the appropriate
MIME content type. The last two entities have null message bodies
and are handled as parameters, public and cross-reference, of an
Application/SDIF content type. The syntax is:
application-subtype := <existing> / RsdifS sdif-param
sdif-param := ";" "public" "="
<an SGML PUBLIC identifier>
/ ";" "cross-reference" "="
<a previous MIME body part>
<-- the enclosing Multipart/SDIF body part is
take as the root (level 1) for numbering body
parts -->
SDIF requires the entity name to accompany each entity in the data
stream. When MIME is used to transfer SDIF data streams the
entity name will be the value of the content description field in
each body part.
Since SDIF does not distinguish the parts of a document entity
(declartion, dtd, and instance) when SGML documents are contained
in a Multipart/SDIF message the document is sent as a single body
part. The application can apply default values for unspecified declarations
and DTDs.
Finally, SDIF uses sequential numbers to uniquely identify each
entity, an entity-identifier in [ISO-SDIF] and to locate the
position of the first external entity, a first-identifier, of each
document. These are not necessary when using the methods in this
document but can be derived. Within a Multipart/SDIF message number
each body part sequentially, starting at 1 with the first
Application/SGML body part. Note that the only Multipart body
part that can be present in a Multipart/SDIF message is
Mulitpart/Alternative. That will resolve into a single body part
and shall be treated as though it were a non-multipart body part.
The subdocument, text and data entities may, in fact, be
Message/External body parts. With the numbering described the
unpacker may, if needed, build a table to translate body parts into
SDIF entity numbers.
6 Security
An SGML parser can be directed to invoke a local process, usually
to format or display a grpahical image. That capability presents
an opportunity for abuse. To understand the potential problems
requires understanding two SGML consturcts, entity and notation
statements, presented below. Capitalized items are literals,
lowercase ones are tokens, and the special characters are markup
escape squences.
<!ENTITY name SYSTEM file NDATA type>
<!NOTATION type SYSTEM qstring>
The document text will refer to name which, in turn, will cause
the application, type, represented by qstring to be invoked.
Qstring could be the DOS command "delete *.*".
To eliminate potential problems it is recommended that the
unpacker replace notation contained within the message with the
appropriate statements for the recipient's environment. An
implementation may use a local configuration file that identifies
the acceptable types and inform the user of types in the message
that are not available in the local environment. They could be
replaced by a no-operation NOTATION statement. It is recommended
that the list of acceptable types be drawn from the MIME set of
types and subtypes.
SGML also provides for sending non-interpreted data to the display
device or typesetter. The security hazard presented is similar to
those posed by the use of PostScript. Greater threats may be
posed by more "powerful" display systems and typesetters.
Unautorized access to the recipient's system and resources may be
possible.
7 References
[ISO-SGML] ISO 8879:1988, Information processing -- Text and
office systems -- Standard Generalized Markup Language
(SGML).
[ISO-SDIF] ISO 9069:1988, Information Processing - SGML Support
Facilities -- SGML Document Interchange Format (SDIF).
[RFC-822] Crocker, D., Standard for the Format of ARPA Internet
Text Messages, August 1982, University of Delaware, RFC
822.
[RFC-HDRC] Moore, Keith, Representation of Non-Ascii Text in
Internet Message Headers, June, 1992, RFC 1522
[RFC-MIME] Borenstein, N. and Freed, N., MIME (Mulitpurpose
Internet Mail Extensions): Mechanisms for Specifying
and Describing the Format of Internet Message Bodies,
June 1992, RFC 1521.
[US-ASCII] Coded Character Set -- 7-Bit American Standard Code for
Information Interchange, ANSI X3.4-1986.
8 Acknowledgements
The author acknowledges Andy Gelsey, Accurate Information Systems,
Inc., Nathaniel Borenstein, Bellcore, Einar Stefferud, Network
Management Asscoiates, Inc, John Klensin, MIT, and Erik Naggum,
for their suggestions, explanations, and encouragement. No errors
or faults in this document can be ascribed to them, they all
belong to me.
UNIX is a registered trademark of UNIX System Laboratories, Inc.
9 Author's Address
Ed Levinson
elevinson@accurate.com
Accurate Information Systems, Inc.
2 Industrial Way
Eatontown, NJ 0772
Appendix A. SGML for IETFers
This appendix describes of the elements of the Standard
Generalized Markup Language (SGML) that are key to understanding
the relationship between SGML and the Multipurpose Internet Mail
Extensions (MIME). For the purposes of this discussion, and
without doing too much damage to the SGML specification, an SGML
document contains text, markup, and references to non-text
document elements (e.g., graphics). For a complete and accurate
description see ISO 8879, Information Processing - Text and office
systems - Standard Generalized Markup Language (SGML).
An SGML document has the following structure (the parenthesized
numbers refer to productions in ISO 8879) and is processed by an
application called an SGML parser. Note that Internet style ABNF
is used for notation here, [ISO-SGML] uses a different style.
sgml-doc ::= sgml-decl dtd doc-inst (2)
sgml-sub-doc ::= dtd doc-inst (3)
Sgml-decl defines the various elements and parameters of SGML.
For example, the characters that introduce and end
markup tags, R<R and R>S respectively will be used
here, the maximum length of markup tags, etc..
Dtd is a document type definition (DTD) which defines the
structure of the document, most important for interchange
considerations the DTD contains references to external
files, system commands, and text to be sent directly to a
typesetter or printer.
Doc-inst is the actual document text; it includes graphic
elements, other text with or without markup, by reference
to DTD elements.
The remainder of this discussion focuses on two elements which a
DTD uses to reference other things, entities and notations. They
appear in the DTD and have the following format.
entity ::= "<!" "ENTITY" name e-text ">" (101)
e-text ::= q-string | data | b-text | external (105)
data ::= ( "CDATA" | "SDATA" | "PI" ) q-string (106)
external ::= ext-id ( "SUBDOC" | ( "NDATA" type )) (108)
ext-id ::= ( "SYSTEM" q-string) | ( "PUBLIC" pub-id [q- string] )(73)
notation ::= "<!" "NOTATION" type ext-id ">" (148)
where name is a character sting, and the definition of b-text left
to ISO 8879; for convenience q-string has been substituted for the
SGML term parameter literal. Entities referred to via the SUBDOC
keyword differ from SGML documents in that they cannot contain an
sgml-decl.
Using the above productions the following sample entities
demonstrate the important issues. Name, xname, and type are alphanumeric
tokens and q-string is a series of characters enclosed in double
(or single) quote marks.
<!ENTITY name PUBLIC pname> (A)
<!ENTITY name SYSTEM fname> (B)
<!ENTITY name SYSTEM fname NDATA type> (C)
<!NOTATION type SYSTEM command> (D)
<!ENTITY name PI q-string> (E)
Form A refers to a well known or "public" name that the SGML
parser is able to resolve; in the marked up text there will be a
markup item "&name" that directs the parser to include the
corresponding public file. Similarly, form B corresponds to a
locally known file. Form C allows the markup text to refer to non-
SGML data, an image for example, and the type parameter must match
the type of a NOTATION element . The matching element's command
parameter specifies the command which processes the file fname.
Finally form E, processing instructions, specifies a string of
characters to be sent directly to the output device.
These examples give rise to the following issues when the document
is transferred from one environment to another.
A Is the public name known to the recipient? The recipient SGML
parser may not know of the public file and this will be
discovered when it processes the document.
B What is the file name on the recipient system? There must be
some process which binds the sender's file names to the
recipient.
C See B and D.
D Direct use of the NOTATION form is a large security risk, an
invitation to a Trojan Horse attack. The recipient must be
protected from a sender invoking an arbitrary command on the
recipient system.
E Processing instructions permit the sender to manipulate the
recipient output device. This is the same risk that exists for
PostScript documents and is not addressed.
Appendix B. Content-Type registrations
_________________________________
B.1 The Application/SGML Content-Type
(1) MIME type name: Application
(2) MIME subtype name: SGML
(3) Required parameters: none
(4) Optional parameters: declaration, dtd, instance, fosi, charset
(5) Encoding considerations: may be encoded
(6) Security considerations: see RFC section 6
(7) Specification:
This subtype is used for text marked with the Standard Generalized
Markup Language. Body parts of this subtype will contain a
Content-Reference: field if this body part is referred to as a
file by an SGML document or subdocument entity or if it is
explictily referred to in a Multipart/SGML parameter.
_________________________________
B.2. The Application/SDIF Content-Type
(1) Mime type name: Application
(2) MIME subtype name: SDIF
(3) Required parameters: one of public or cross-reference
(4) Optional parameters: none
(5) Encoding considerations: none
(6) Security considerations:
(7) Specification:
This subtype contains a NULL or empty message body. The value of
the public parameter is an SGML PUBLIC entity identifier. The
value of cross-reference is the body part identifier of a
previously occurring body part.
_________________________________
B.3. The Multipart/SGML Content-Type
(1) Mime type name: Multipart
(2) MIME subtype name: SGML
(3) Required parameters: boundary
(4) Optional parameters: declaration, dtd, fosi, instance
(5) Encoding considerations: none
(6) Security considerations: see RFC section 6
(7) Specification:
_________________________________
B.4. The Multipart/SDIF Content-Type
(1) Mime type name: Multipart
(2) MIME subtype name: SDIF
(3) Required parameters: boundary
(4) Optional parameters: charset
(5) Encoding considerations: none
(6) Security considerations: none